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Abstract. We show how to improve the efficiency of the computation 
of fast Fourier transforms over F p where p is a word-sized prime. Our 
main technique is optimisation of the basic arithmetic, in effect decreas- 
ing the total number of reductions modulo p. We give performance 
results showing a significant improvement over Shoup's NTL library. 



1. Introduction 

An important problem in computational number theory and cryptography 
is the efficient implementation of modular arithmetic. A typical implemen- 
tation strategy is to represent elements of Z/iVZ by residues in a standard 
interval, such as [0, N) or [— N/2, N/2), and to reduce intermediate results 
to this interval after each operation in Z/iVZ, such as addition or multipli- 
cation. 

In many algorithms one can obtain a substantial constant factor speedup 
by delaying the modular reduction until after several arithmetic operations 
have been performed, by taking into account the bit-size of intermediate 
results. For example, to compute a dot product J2i a ibii a fundamental 
operation in linear algebra, one may accumulate terms in batches, using or- 
dinary integer (or even floating-point) arithmetic, and perform the reduction 
modulo N after each batch pQ. 

The aim of this paper is to describe a strategy for reducing the number of 
modular reductions in the computation of a discrete Fourier transform over 
a finite field, also known as a number-theoretic transform (NTT). The NTT 
has a vast range of applications; we mention here only fast multiplication of 
large integers or polynomials [5]. 

For simplicity we restrict to the following situation. Let f3 describe the 
machine word size, for example (3 = 2 32 or /3 = 2 64 . We work over F p = 
Z/pZ where p is a word-sized prime, and we assume that p = 1 (mod L), 
where L = 2 e is the transform length. This ensures that F p contains a 
primitive L-th root of unity; we fix one of these, denoted uj. The NTT is 
the map (F P ) L — > (F p ) L defined by 

L-l 



bj = u^ai, < j < L. 



i=0 



i 
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Equivalently, this is the map that evaluates the polynomial a® + a\x + • • • + 
cll-\x l _1 at the points 1, to, u 2 , . . . , . 

It is well known that the fast Fourier transform (FFT) can be used to 
compute the NTT using 0(L log L) operations in F p . For completeness, a 
simple in-place iterative radix-2 FFT algorithm is shown in Algorithm Q3 



Algorithm 1: Simple FFT 



Input: oj G F p and x = (xq, . . . ,xl-i) G (P p ) L 
Output: DFT of x with respect to oj, in bit-reversed order 
l for i «— 1,2, . . . ,£ do 

C <- OJ 2 

for < j < T do 

t <— 2jm 

for < k < m do 













■Kt+k+m 


<— 


C ( x t+k ~ x t+k+m 



end 



end 



10 end 



Our main focus in this paper is on the butterfly operation in line El which 
computes (X,Y) (-)■ ( X + Y, W ( X — Y ) ) for some fixed root of unity W G F p 
(a suitable power of u>). At first glance this requires one modular addition, 
one modular subtraction, and one modular multiplication, and this is how 
the butterfly is usually implemented. Of course these modular operations 
are themselves implemented on modern microprocessors using more basic 
primitives. For example, a modular addition is usually implemented as an 
ordinary integer addition, followed by a comparison with p, followed by a 
conditional subtraction. In this paper we investigate the butterfly at this 
lower level, showing how to streamline the implementation to reduce the 
number of comparisons and conditional operations. Another interpretation 
is that we have reduced the number of modular reductions. 



2. A TYPICAL BUTTERFLY IMPLEMENTATION 

Victor Shoup's NTL (Number Theory Library) [4] is a popular C++ li- 
brary used in computational number theory. It makes heavy use of the NTT. 
Its implementation of the butterfly (X,Y) \-t (X + Y,W(X — Y)) may be 
expressed by the pseudocode shown in Algorithm [2j It has been simplified to 
focus attention on the arithmetic aspects, ignoring issues like loop unrolling, 
software pipelining, and locality. All variables represent register-sized quan- 
tities. 



FASTER ARITHMETIC FOR NUMBER-THEORETIC TRANSFORMS 



3 



Algorithm 2: NTL butterfly implementation 
Constants: p < 8/2 

0<W <p 

W' = [W/3/p\, < W <8 
Input: < X < p 
< Y < p 
Output: X' = X + Y (mod p) 

Y' = W(X -Y) (mod p) 

1 X' 4r- X + Y 

2 if X' > p then X'i-X'-p 

3 T <- X -y 

4 if T < then T ^T + p 

5 q <- L^'r//3j 

6 y' «- (WT - Qp) mod /3 

7 if y' > p then Y' <- Y' -p 

8 return X' , Y' 



Lines HH2] compute the sum X + Y and reduce it modulo p to the standard 
interval [0,p), using the assumption p < (3/2 to avoid overflow in the first 
line. Lines [3HH compute T = X — Y (mod p) in the same way, assuming 
that T has a signed type for the comparison. 

Lines EH3 compute the product WT (mod p). Line [5] first generates an 
estimated quotient Q. The expression [W'T/f3\ represents the high word of 
the product W'T. We assume that this high word is computed efficiently 
using a hardware multiply instruction. By the definition of W' and Q we 
have 

WB , W'T 
<ll£_W<i o < - Q < 1. 
p B 

Multiplying by Tp/B and p respectively, and adding, yields 

Tv 

< WT - Qp < + p < 2p < B. 

In particular, line E] correctly computes Y' = WT — Qp, and the single 
correction in line [7] suffices to reduce it into [0,p). 

The modular multiplication algorithm in lines [5HZ1 is attributed to Shoup 
in [2], but does not seem to have been published. It first appears in NTL 
version 5.4 in 2005. The use of a suitable precomputed approximation to 
W/p implies that only a single correction step (line [7]) is necessary, and that 
the remainder is obtained using only multiplication modulo B (line [6]), an 
advantage on processors that can compute the low word faster than the full 
product. 
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In our exposition we assumed for simplicity that p < (3/2, but Niels Moller 
has pointed out (personal communication) that this can be improved. The 
modular subtraction can be made to work for any p < (3 by replacing the 
condition T < by X < Y, or indeed by using the borrow generated by the 
subtraction T = X — Y . The addition can be treated similarly by rewriting 
it as X' = X — (p — Y). A more careful analysis of the candidate remainder 
WT — Qp then shows that the entire algorithm works for any p < /3/<f> where 
0= 1(1 + ^5) « 1.618. 

3. A MODIFIED BUTTERFLY 

In this section we propose several modifications to Algorithm [2 Our 
motivation is that the adjustment steps in lines O [5] and [7] are relatively 
expensive on modern microprocessors, compared to hardware integer multi- 
pliers, which in recent years have become very fast. 

Our first observation is that Shoup's algorithm for computing WT (mod p) 
does not require that < T < p; in fact the proof given above shows that 
it works for any < T < (3. Therefore we may replace lines [SHU by simply 
T <— X — Y + p, after which we have < T < 2p < (3, and the algorithm 
proceeds as before. This simplification is not new, although it does not ap- 
pear to be well known. It is not used in NTL. It was apparently used by 
Fabrice Bellard in a computation of ir to 2.7 trillion decimal places in 2009 
(personal communication) . 

Our second observation appears to be new, and is the main novelty intro- 
duced in the present paper. Namely, we may also remove the adjustment in 
line provided that throughout the FFT we use a redundant representa- 
tion for elements of F p . That is, instead of representing elements of F p by 
integers in [0,p), we use the wider interval [0,2p), so each element has two 
possible representations. For this to work, we must modify the butterfly to 
accept inputs in [0, 2p), and we must impose the stronger condition p < (3 /A. 
Pseudocode for the resulting butterfly is shown in Algorithm [3j 

According to Jason Papadopolous (personal communication), around 1998 
Ernst Mayer suggested the use of a redundant representation in the context 
of a fast Galois transform, i.e. a Fourier transform over F„2 where q = 2 61 — 1. 
Our new algorithm may be regarded as a generalisation of this idea to the 
case of an NTT with arbitrary modulus p. 

The proof of correctness is much the same as before. Lines HH21 compute 
a representative for X + Y (mod p) in the interval [0, 2p). Line [3] computes 
a representative T for X — Y (mod p) in the interval [0, 4p); this does not 
overflow as p < (3/4. Lines HHS] compute a representative for WT (mod p) in 
the interval [0, 2p). Both outputs X' , Y' lie in [0, 2p), ready to be processed 
by a subsequent butterfly. 

Depending on the needs of the NTT user, it may be necessary to normalise 
the final NTT output into the interval [0,p), imposing an additional 0{L) 
cost. 
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Algorithm 3: Modified Shoup butterfly- 
Constants: p < /3/4 

< W <p 

W' = [W/3/p\, < W </3 
Input: < X < 2p 
< Y < 2p 
Output: X' = X + Y (mod p), < X' < 2p 

Y' = W(X - Y) (mod p), < Y' < 2p 

1 X' <- X + Y 

2 if X' > 2p then X' <- X' - 2p 

3 T <K X - Y + 2p 

4 Q <- |W'T/£J 

5 Y' <- (TYT - Qp) mod /3 

6 return X', Y' 



4. Variants 

The same idea may be applied to butterflies using other modular multi- 
plication algorithms. Algorithm |4] shows a variant using Montgomery mul- 
tiplication [3]. 



Algorithm 4: Modified Montgomery butterfly 
Constants: p odd, p < /3/4 
< W <p 

W = PW (mod p), < W < p 
J = p^ 1 (mod 0) 
Input: < X < 2p 
< Y < 2p 
Output: X' = X + Y (mod p), < X' < 2p 

Y' = W(X - Y) (mod p), < Y' < 2p 

1 X' ^ X + Y 

2 if X' > 2p then X' <- X ' - 2p 

3 T ^ X -Y + 2p 

4 iii/3 + R <- W'T 

5 Q <— f? J m od /3 

6 if <- |_Qp//3J 

7 Y> <- R 1 -H + p 

8 return X', Y' 



6 



DAVID HARVEY 



Lines HH31 are identical to the corresponding lines in Algorithm [3l Line H] 
computes the product W'T, placing the low and high words of the result in 
Rq and R\ respectively, so that < R\ < p. Line [5] computes Q = Ro/p 
(mod /3) (this may be regarded as a 2-adic approximation to the quotient 
W'T/p). Line[6] computes the high word of Qp, so that < H < p. We have 
Qp = R (mod /3), so Qp = H/3 + R . Then W'T - Qp = p(Ry - H), and 
thus WT = R\ — H (mod p). This agrees modulo p with the value computed 
for Y' in line which lies in the interval [0, 2p). Usually in Montgomery's 
algorithm, a further comparison and conditional subtraction (or addition) 
is performed to normalise the result into [0,p), but we have simply skipped 
that. 

Compared to Algorithm [3l this butterfly algorithm has the advantage 
that only a single value W must be stored in a table for each root of unity 
(W is not actually used in the algorithm). On the other hand, one of the 
multiplications modulo (3 has been replaced by a full product, which may 
be more expensive on some processors. 

In some situations one requires a butterfly of the form (X, Y) \— > (X + 
WY, X—WY). This appears if one switches from a 'decimation-in-frequency' 
transform to a 'decimation-in-time' transform. It also appears naturally in 
the inverse FFT obtained by running Algorithm [1] backwards. Algorithm [5] 
shows an analogue of Algorithm [3] for this case. The main difference is that 
elements of F p are represented by an integer in [0, 4p), so each residue has 
four possible representatives. 

Algorithm 5: Modified inverse butterfly 
Constants: p < /3/4 

< W <p 

W' = [W/3/p\, < W </3 
Input: < X < 4p 
< Y < \p 

Output: X' = X + WY (mod p), < X' < 4p 
Y' = X - WY (mod p), < Y' < 4p 

l if X > 2p then X <- X - 2p 

2Qf- [W'Y/p\ 

3 T <- (WY - Qp) mod /3 

4 X' <- X + T 

5 Y' <- X - T + 2p 

6 return X' , Y' 

Line [H reduces X into [0, 2p), lines [2H3] compute T = WY (mod p) in 
[0, 2p), and the remaining lines complete the calculation of X' and Y' in 
[0, Ap). As in Algorithm [3l this strategy saves two modular reductions com- 
pared to the usual implementation. 
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5. Implementation and performance 

The practical benefit (if any) derived from the new algorithms depends 
heavily on the hardware used. To illustrate what may occur in practice, we 
performed some timing experiments on a 3.06GHz Intel Xeon (model X5675, 
'Westmere' microarchitecture), a modern 64-bit processor. 

We compared a C implementation of a number-theoretic transform in- 
corporating Algorithm O a similar implementation of a transform based on 
Algorithm^ and the FFT routine in NTL itself (version 5.5.2). The code is 
available from the author's web site. It is moderately optimised: the inner 
loops are 2-way unrolled, and the last two layers of the FFT have dedicated 
loops. The values of W and W are precomputed and stored in a table. 
Everything was compiled using GCC 4.6.3, with the -02 optimisation flag. 
Access to the high word of the product of two 64-bit integers is obtained 
using the compiler's built-in support for 128-bit integer types. 

We ran transforms of length 2 11 = 2048. This is long enough to avoid 
too much loop overhead, but short enough that all memory accesses hit the 
LI cache, so we may ignore locality problems. For our code we used a 62- 
bit prime p, and for NTL we used a 50-bit prime (NTL supports moduli 
of only up to 50 bits on a 64-bit machine, for historical reasons related to 
floating-point arithmetic). 

The NTL FFT routine also performs a bit-reversal of the input array and 
computes a table of roots of unity on each FFT invocation. To make a fair 
comparison with our code, which does not perform these steps, we removed 
their contribution in the timing data shown below. 

Table 1. Cycles per butterfly for several implementations 



NTL 


12.0 


(entire FFT) 


Algorithm [2] 


10.1 


(entire FFT) 




11.0 


(inner loop) 


Algorithm [3] 


6.9 


(entire FFT) 




7.5 


(inner loop) 



For the lines labelled 'entire FFT', we measured the total time (CPU 
clock cycles) taken by the FFT, and divide by the number of butterflies 
performed, which is 11 x 2 10 . Note that O(L) of the O(LlogL) butterflies 
have W = 1, and these are faster than the general butterfly with W ^ 1. 
For the lines labelled 'inner loop', we measured the speed of the main inner 
loop, which does not take advantage of any occurrences of W = 1. 

Two features of the table are worth pointing out. First, our implemen- 
tation of Algorithm [2] is competitive with NTL. In fact, it is slightly faster. 
We do not know the precise reason for this. The C compiler has many 
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choices in how it generates the machine code, and in some cases may gen- 
erate suboptimal code. We may have simply been luckier than NTL in this 
case. 

Second, Algorithm[3]outperforms Algorithm [2] by a factor of almost 1.5. In 
favourable circumstances, the Westmere processor can sustain a maximum 
throughput of one 64-bit multiplication every 2 cycles. If we assume that a 
butterfly requires three such multiplications, then the best we can expect is 
6 cycles per butterfly. The 7.5 cycles reported above comes fairly close to 
this. We expect that a careful assembly implementation would get closer, 
and possibly even reach exactly 6, with all other operations executed in 
parallel with the multiplications. It seems unlikely that Algorithm [2] could 
achieve this speed. 
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